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We propose a solution on the stopping criterion in segment- 
ing inhomogeneous DNA sequences with complex statistical 
patterns. This new stopping criterion is based on Bayesian In- 
formation Criterion (BIG) in the model selection framework. 
When this stopping criterion is applied to a left telomere se- 
quence of yeast Saccharorayces cerevisiae and the complete 
genome sequence of bacterium Escherichia coh, borders of bi- 
ologically meaningful units were identified (e.g. subtelom- 
eric units, replication origin, and replication terminus), and 
a more reasonable number of domains was obtained. We also 
introduce a measure called segmentation strength which can 
be used to control the delineation of large domains. The re- 
lationship between the average domain size and the threshold 
of segmentation strength is determined for several genome se- 
quences. 



PACS number(s): 
02.50.Tt, 89.75Da, 



87.10.-he, 
B9.75.Fb 



87.14.Gg, 87.15.Cc, 02.50.-r, 



DNA sequences are usually not homogeneous. Re- 
gions with high concentrations of G or C bases alter- 
nate with regions which lack G or C [Q; stretches of 
sequences with an abundance of CG dinucleotide {CpG 
island) interrupt regular sequences; coding regions distin- 
guish themselves from non-coding regions by the strong 
periodicity-of-three pattern, etc. The alternation of long 
(e.g. > 300 kilobases) G-l-C rich and G-l-C poor regions 
(also known as "isochores" j^]) is shown to be related to 
chromosome bands, gene density, and perhaps chromo- 
somal structure The concepts of inhomogeneity and 
domains can also be generalized recursively to different 
length scales, and such domains- within-domains phenom- 
ena have indeed been observed in DNA sequences 
These hierarchical patterns are the cause of the fractional 
long-range correlations and 1/f spectra observed in DNA 
sequences There have been discussions of the possi- 
ble biological meaning of this hierarchical pattern and 
its connection to other complex systems . 

Computational methods used to identify homogeneous 
regions are called segmentation procedures [||J^] which 
are important for many DNA sequence analysis tasks: 
detecting the existence of isochores, identifying compli- 
cated repeat patterns within telomeres and centromeres, 
determining coding-noncoding borders etc. Segmen- 
tation procedures can also be applied to any inhomoge- 
neous/disorder media (e.g. 1-dim solid, spinglass chain) 
or nonstationary time series (e.g. symbolic dynamics) to 



determine the domain borders or turning points. An ap- 
plication of the segmentation procedure to determine the 
mobility edge of vibrational states in disordered materi- 
als can be found in . The segmentation procedure and 
the physical fragmentation pO[ | arc highly reminiscent of 
each other . The case of a segmentation procedure di- 
rectly affects the scaling exponent of the size distribution 
in a fragmentation |pd]] . 

In the segmentation procedure proposed in [|j , one cru- 
cial step - the stopping criterion - is arbitrarily deter- 
mined. This is because this criterion is presented within 
the framework of hypothesis testing. It is common in this 
framework to reject or accept the null hypothesis based 
on a chosen significance level, typically, 0.01 or 0.001. 
Not choosing other levels, say, 0.025 or 10~^, is to some 
extent arbitrary. Another practical problem of the crite- 
rion in 1^ is that it is extremely hard to halt the recursion 
at a large length scale even with a very small significance 
level, whereas many biologically interesting domains such 
as isochores are large. We solve these problems here by 
discussing segmentation in a new framework - the model 
selection framework. As a result, an alternative meaning 
of segmentation is proposed, and a minimum requirement 
for choosing one model over another is introduced. 

In the model selection framework, basic l-to-2 seg- 
mentation is carried out as a comparison of two stochas- 
tic models of the DNA sequence: before the segmenta- 
tion, the sequence is modeled by a homogeneous ran- 
dom sequence (with three base composition parame- 
ters); after the segmentation, by two homogeneous ran- 
dom sequences separated by a partition point (with 
seven parameters). Whether a l-to-2 segmentation 
should be continued or not is determined by whether the 
two-random-subsequence model is better than the one- 
random-sequence model. In model selection, the answer 
to this question is determined by two factors: first, the 
model's ability to fit the data; and second, the model's 
complexity. Overfitting and underfitting models are not 
considered to be good, either because of high model com- 
plexity or because of poor fitting performance. The 
Bayesian information criterion (BIG) is a proposal for 
balancing the two factors, defined as [l^ ]: 

BIC^~2\og{L)+log{N)K + 0{l) + 0{ ^ 
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~2\og{L) +log{N)K 



(1) 



where L is the maximum likelihood p3[ , K the number 
of parameters in the model, and N the number of data 
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points. BIC is an approximation of the logarithm of inte- 
grated hkehhood of a model multiplied by —2 The 
integrated likelihood represents the overall performance 
of a model. The better the model, the larger the inte- 
grated likelihood, and thus the smaller the BIC. A similar 
concept is the Akaike Information Criterion (AIC) [p^,^ 
with the log(7V) term in Eq.(|lJ) replaced by 2. BIC pe-g 
nalizes complex models more severely than AIC. 

We show here that the entropy-based segmentation 
in 1^ can be recast in the likelihood framework [ p^ , 
which in turn can be generalized to a model selec- 
tion framework |l^]. The likelihoods of the random- 
sequence model and the two-random-subsequence model 
(before and after a l-to-2 segmentation) are: Li{{pa}) = 

L,i{p^j,{pi},Ni) = uMr-iiM)'''^ 

where {pa} , {Pa} APa} (q!=A,C,G,T) are the base com- 
position parameters for the whole sequence, left and right 
subsequence, respectively; {Na}, {-/V^}, {NJ^} are the cor- 
responding base counts; and Ni is the size of the left 
subsequence. The maximum likelihood estimation of the 
parameters is simply pa — Na/N, and the maximum log 
likelihoods before and after segmentation are logii — 
NE and logLj = N^E^ + N"" E\ where E,E\E'' are 
the entropies for the whole, left, and right sequences. 
The segmentation position Ni is also a parameter in the 
model, and is determined by the position that maximizes 
the likelihood (though this parameter is discrete and it's 
range changes with N). The increase of log-likelihood 
is log(i2/ii) = NE - {N^E^ + N'^E'') = N ■ Djs, 
where Djs is the maximum of Jensen-Shannon diver- 
gence Djs ^E^{N^E^ + N'-E'-)/N 

We require that the BIC be reduced by the segmenta- 
tion for the procedure to continue, i.e. ABIC < 0, which 
leads to (note K2 — 7 and Ki = 3 iflT 



2iV^j5 >41og(7V). 



(2) 



Eq.(^ is our new stopping criterion. 

Lower (relaxed) bound of the signiEcance level: The 
stopping criterion in Eq.(^) differs from the criterion in 
in that the significance level cannot be arbitrarily re- 
laxed. The criterion in Q| compares the maximum Djs 
with that of a random sequence. If the sequence is in- 
deed random, 2NDjs is known to follow a distribu- 
tion , and the tail-area under this distribution is the 
corresponding significance level p^ . The new criterion 
in Eq.(||) requires that the significance level cannot be 
too relaxed. For example, if is 1 kilobase, Eq.(H) is 
equivalent to setting the significance level 1.48 x 10"*^, 
and if A^ is 1 megabases, it is 2.86 x 10^^^. The depen- 
dence of Eq. (H) on the sequence length A^ has important 
practical implications: the stopping criterion in Eq.(|^) 
is not fixed but adjustable. It is particularly important 
for a long sequence, when the criterion in Q may not be 
able to stop segmentations with large 2NDjs. 




FIG. 1. Partition points determined by the segmenta- 
tion with the stopping criterion Eq.(^ for the left telom- 
ere of yeast S. cerevisiae chromosome 12 (dashed vertical 
lines). The partition points determined by AIC (dot) (with 
the high-order term included), hypothesis testing framework 
with significance level of 0.05 (dot), 0.01 (cross), 0.001 and 
0.0001 (solid dot) are shown for comparison. Also shown is 
the G-l-C content in moving windows (window size=150 bases, 
moving distance=51 bases). The location of the telomeric se- 
quence (TEL) and subtelomeric sequences (Y' and X) are 
marked. The lower plot shows the segmentation strength s of 
a l-to-2 segmentation. The numbers are the order in which 
the segmentation is carried out. 

In Fig.|l|, we illustrate the new criterion for the left 
telomere of chromosome 12 of yeast Saccharomyces cere- 
visiae . It is known that telomere sequences are com- 
positionally complex. There is a highly repetitive se- 
quence called TEL at the tip of the telomere (for yeast, 
it is 5'-Ci_3A-3'). There are also subsequences that are 
conserved among different yeast chromosomes: the Y' 
and X subtelomeric sequence A segmentation pro- 
cedure can be applied to telomere sequences to identify 
some compositionally distinct elements |2l[| . It can be 
seen from Fig.|l| that the criterion in Eq.(^ manages to 
delineate the borders for TEL and X elements Al- 
though Eq.(||) missed the two Y' elements, an indication 
that Y' elements are not compositionally distinct, it is 
the cost of avoiding many false positives. 

Segmentation strength: Although a lower (relaxed) 
bound of the significance level is set in Eq. (H) , no limit 
on the upper (stringent) bound is possible. We introduce 
a measure for segmentation strength s pq| : 



2NDjs - 4 log(A^) 
4log(iV) ' 



(3) 



and the stringency level can be raised by choosing a non- 
zero value of the threshold sq: s > sq > 0. Eq.(|^) is 
equivalent to sq = 0. The prominence of TEL and X 
elements is indicated by their large segmentation strength 
(s = 170.66%, 84.6%, and 416.33%; see the lower plot of 
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Fig.|lD. These segmentations are also chosen earher in the 
recursive segmentation (being first, second, and third). 




FIG. 2. Segmentation points determined by Eq.(^ for 
E. coli genome (dashed vertical lines). Also shown are the 
G+C content in moving windows (window size=9000 bases, 
moving distance=357I bases), and the segmentation strength 
s. The segmentation points determined by the AlC-based 
stopping criterion are shown by the dots. The replication 
origin, replication terminus, and the 9 largest domains are 
marked in the plot. Each one of the subplots represents 1 
megabase of the sequence (total length is 4.639 megabases). 

Minimum Domain Size: To test a model on a dataset, 
the number of samples must be larger than the number of 
parameters in the model. Since we compare two models 
with 3 and 7 parameters, respectively, the sequence has 
to contain at least 7 bases before the segmentation, and 3 
bases after the segmentation. Unlike the criterion in 
these minimum size requirements are not set arbitrarily. 

Binary and 12-Symbol Sequences: For many practi- 
cal applications of the segmentation procedures, DNA 
texts are converted to symbolic sequences with less (or 
more) than four symbols. For example, the two-symbol 
sequence with symbols S (for strong, G and C) and W 
(for weak, A and T), is frequently used for studying large- 
scale homogeneous domains. The stopping criteria for bi- 
nary sequences can be modified easily: with Ki — 1 and 
K2 = 3, the right-hand-side of Eq.(||) becomes 21og(N). 
For coding region recognition, it is proposed in Q that 
a DNA sequence can be converted to a 12-symbol se- 
quence: each symbol contains information on both the 
base and the codon position (i.e. Ai, Ci, Gi, Ti, A2, • • •). 
With Ki = 9 and K2 = 19, the stopping criteria in Eq.(||) 
become 2NDjs > lOlog(N). 

Threshold for segmentation strength and domain sizes: 
Since Eq.(H) does not provide an upper (stringent) limit 
on the significance level, there is still some degree of sub- 
jectivity in our segmentation procedure. If one is inter- 
ested in largest domains, or the strongest segmentation 



signals, the threshold for segmentation strength sq should 
be set larger than zero. Taking the complete sequence of 
Escherichia coU genome for example, the replication 
origin and the replication terminus presents the two most 
significant segmentation signals. If the sq is set to 20, 
only these two l-to-2 segmentations will make the cut. 

The larger the Sq, the larger the domain sizes in the 
final configuration. The relationship between the two 
is empirically determined by segmentations on several 
genome sequences, shown in Fig.^ It can be seen that 
the relationship is not universal for all sequences: with 
the same sq, sequences with high compositional complex- 
ity (e.g. MHC sequence) contain smaller domain sizes in 
the final configuration than sequences with lower com- 
plexity (e.g. yeast). It can also be seen that in order 
to reach the average size of isochore (300 kilobases), sq 
should be set as large as 500%. 
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FIG. 3. Average domain size vs. segmentation strength 
So for these sequences: human major histocompatibility com- 
plex (MHC), A bacteriophage, chromosome 3 of S. cerevisiae, 
E. coli, left and right arms of chromosome 2 of Drosophila 
melanogaster. 

Domain size distribution: Another indirect evidence 
that our new stopping criterion is more reasonable than 
the one in j2| (with a typical significance level) can be 
seen by examining the domain size distribution in the fi- 
nal configuration. The 281 domains in the Escherichia 
coli genome in Fig.^ are ranked by size. These sizes are 
plotted against the rank (Zipf 's plot) in Fig.^ The Zipf 's 
plot for sizes from rank 4 to rank 180 approximately ex- 
hibit a power-law (Fig.||). This is similar to the 
power-law behavior in Zipf 's plot of many other natural 
and social phenomena (known as Zipf 's law |^,^ when 
the scaling exponent is close to —1). 

When a more relaxed stopping criterion is used, there 
is a lack of large domains. We illustrate this by a AIC- 
based segmentation which is equivalent to the criterion 
in §1 with the significance level of 0.091578. The Zipf's 
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plot for domains derived from the AlC-based segmenta- 
tion is not a power-law function. Even a forced curve- 
fitting by a power-law function leads to a slope merely 
~ —0.5. This indicates that the size distribution by cri- 
terion Eq.(^ is more self-similar, more balanced between 
the small and large domains than those by the AlC-based 
segmentation. 
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FIG. 4. Size-rank plot (Zipf's plot) of domains obtained 
by segmentation with the stopping criterion in Eq.(|2|). Those 
obtained by the AlC-based segmentation are also shown. 

In summary, this paper solves a problem encountered 
in that recursive segmentation is not easy to stop even 
when a stringent significance level is used (the most strin- 
gent significance level in the SEGMENT program |2^] 
is 10~^). This solution allows us to investigate much 
larger domains and longer-range hierarchical correlation 
in DNA sequences. The framework from which our so- 
lution is derived is also ideal for generalizations to other 
more complicated situations. Determining the number of 
domains in a DNA sequence, like any other descriptions 
of the sequence, is relative - it is relative to the length 
scale of interests, relative to the model used. By chang- 
ing the segmentation strength, we essentially change the 
level of description of the sequence. 

The work is supported by the grant K01HG00024 from 
NIH. I thank Jose Oliver for sending me the partition 
points used in Fig.^ produced by the SEGMENT pro- 
gram pq]. This paper is dedicated to XML. 
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